PROMPI: a DiRAC RSE success story

DiRAC Science Day 2024

Miren Radia

University of Cambridge

Thursday 12 December 2024

Introduction

The team

Raphael Hirschi
Professor of Stellar Hydrodynamics and Nuclear Astrophysics
Keele University

Vishnu Varma
Research Associate In Theoretical Stellar Astrophysics
Keele University

Miren Radia
Research Software Engineer
University of Cambridge

Others

  • Federico Rizzuti, former PhD Student, Keele University
  • Caitlyn Chambers, new PhD student, Keele University

PROMPI

What does the code do?

  • PROMPI is a fluid dynamics code that is used to simulate complex hydrodynamic processes within stars.
  • Numerical methods:
    • Finite volume
    • Eulerian
    • Piecewise Parabolic Method (PPM) hydrodynamics scheme
  • Physics:
    • Fully compressible fluids
    • Nuclear burning
    • Convection/turbulence
  • Code:
    • Fortran
    • Parallelised with domain decomposition distributed with MPI

Evolution of \(|\mathbf{v}|\) for a \(1024^3\) simulation of the Carbon-burning shell

Previous RSE work

What improvements had already been made to the code?

Over several DiRAC RSE projects, the code has been enhanced and modernized in several different ways:

  • Acceleration on Nvidia GPUs using OpenACC
  • Fortran 77 → Modern free-form Fortran
  • Object-oriented design (Fortran 2003)
  • Legacy include statements and common blocks → Modules
  • Custom Makefile build system → CMake
  • Custom binary I/O format → HDF5
  • Regression tests and GitLab CI pipeline to run them

This project

Aims

What still needed to be done for the new code to be research-ready?

Despite the enhancements, there was still work that needed to be done before the group felt they could switch over:

  1. Consistency between the results on GPU and CPU.
  2. Optimal performance on the DiRAC systems the group uses (COSMA8 and Tursa).
  3. Porting and testing of physics modules and initial conditions to simulate specific scenarios from the old version of the code.
  4. Poor scaling on GPUs beyond a single GPU.

Work summary

What improvements were made to the code?

During the project, changes I made include:

  • Improvements and updates to the CMake build system.
  • Dependency software stack creation on Tursa and greenHPC (Keele local system).
  • Refactoring, updating and adding to the test and CI frameworks.
  • Fixing and refactoring the analysis/plotting Python scripts.
  • Significant refactoring of the MPI communication.
  • Fixing the HDF5 checkpoint and restart consistency.
  • Benchmarking and scaling analysis.

Improving MPI communication

The problem

What was causing such poor performance on GPUs?

Previously the code used:

  • Nvidia managed memory extension to OpenACC:
    • The runtime automatically migrates data between host (CPU) and device (GPU) as required.
  • MPI Derived Datatypes:
    • MPI_Type_vector to simplify halo/ghost cell exchange since this data is non-contiguous in memory, albeit regularly spaced.
  • Effectively blocking MPI calls
    • MPI_Wait was called after every MPI_Irecv.

This combination meant lots of small host-device data migrations → bad for performance:

  • For a \(512^3\) test simulation running on 8 Tursa Nvidia A100s (2 nodes), > 90% of the walltime was spent in communication.

The solution

Benefits

Scaling

Other benefits

Any questions?